Circuit Breaker Pattern

Let's learn about how circuit breakers help keep our services available.

Introduction #

Circuit breakers are a staple in every modern house. The electricity in our houses comes through the main grid and flows through the circuit breakers. There’s a chance that the grid might behave abnormally once in a while, causing an electrical surge that the wiring of our house might not be able to tolerate. Circuit breakers help us prevent the above scenario, protecting our wiring and appliances by switching them off if they detect an abnormally high amount of power.

The circuit breaker pattern that’s often used in API design functions is quite similar to electrical circuit breakers. It acts as a protective layer of our APIs, preventing them from receiving more requests than they can handle. It’s an effective method for increasing the availability of our services by detecting and potentially preventing cascading failures. It allows us to create a fault-tolerant system that can survive when key services are unavailable.

When deploying a service, clients are often assured that the application will be available 99.999% of the time. This allows for only 0.001% of downtime. Let’s take a look at some calculations to see exactly what that entails:

24 hours×365 days=8760 hours/year24\ hours \times 365\ days = 8760\ hours/year

8760 hours×60 minutes=525600 minutes/year8760\ hours \times 60\ minutes = 525600\ minutes/year

525600 minutes×0.001100 =5.256 minutes525600\ minutes \times \frac{0.001}{100}\ = 5.256\ minutes

A downtime of 0.001% equals 5.256 minutes. This means that our service will be down for around 5 minutes a year. Now, 5 minutes in a year doesn’t sound like much, but there’s something we haven’t considered yet.

An application can depend on hundreds of microservices running to handle all the tasks the application needs to carry out. Let’s suppose that our application has 250 microservices running, each with a downtime of 5.256 minutes a year. Assuming each of these 250 services can fail at different times, and the failure of any one of the services means failure of the overall service, then the overall downtime will be as follows:

(5.256×250)60 mins=21.9 hours\frac{(5.256 \times 250)}{60\ mins} = 21.9\ hours

So, if we have 250 microservices—even with a downtime of merely 0.001%—then we have a downtime of nearly 22 hours per year. Such high numbers are not acceptable and that’s where patterns like circuit breaking shine.

The circuit breaker pattern#

The circuit breaker pattern is straightforward and has three states in its life cycle:

  • Closed: This is considered to be the normal state. In the closed state, all the calls being made to the service pass through, and the number of failures is monitored.

In the closed state of a circuit breaker, the service functions normally and entertains the requests of the client
In the closed state of a circuit breaker, the service functions normally and entertains the requests of the client
  • Open: If the microservice experiences slowness, the circuit breaker begins getting failures for requests to that service. Once this number of failures passes a certain limit, the breaker goes into the open state and activate a timeout for the service. In this state, any requests sent to the microservice are stopped by the circuit breaker immediately. Instead, the circuit breaker can send a response to the client, informing it about the timeout of the microservice and if an alternative service is available. It also advises the clients to send the requests to that service. The client can then either choose to redirect to a different instance of the service or choose to wait until the service is back online. This gives the microservice some time to recover by eliminating the load it has to handle. Once the timeout has expired, the breaker will pass to a half-open state.

Note: When the circuit breaker is in the open state, it prevents the requests from reaching the microservice. This is not a failure of the service but rather an interception by the circuit breaker to give the downed microservice time to recover. Without this intervention, microservices can fail in a cascading fashion, and rebooting such a complex system is often time consuming. Our overall service should be designed in such a way that if few dependent services are not available temporarily, the overall service degrades slowly (instead of a total outage).

In the open state of a circuit breaker, the breaker does not allow any requests to go through to the service
In the open state of a circuit breaker, the breaker does not allow any requests to go through to the service
  • Half-open: In this state, a limited number of requests are allowed to pass through to the microservice. This is done to test if the underlying issue is still persisting. Even if a single call  allowed to the microservice fails in the half-open state, then the breaker trips and goes back to the open state. However, if all the calls sent succeed, then the breaker resets to the closed state and begins operating as normal.

Created with Fabric.js 3.6.6
If the requests fail in the half-open state, then the circuit breaker reverts back to the open state

1 of 2

Created with Fabric.js 3.6.6
If the calls succeed, then the circuit breaker updates the status to closed

2 of 2

Note: A request is considered failed if the microservice isn't able to respond within a specific time defined for that service. A rejection from the circuit breaker is not considered as a failed service. In the half-open state, the circuit breaker rejects some requests by itself and allows limited requests to go through to the microservice to test its functionality.

Here, we depict the lifecycle of a circuit breaker and the events that trigger its different states:

The lifecycle of a circuit breaker
The lifecycle of a circuit breaker

Now that we know what circuit breakers are and how they work, let's take a look at some scenarios where we can take advantage of circuit breakers.

Example scenario#

Let's suppose we have an application with five different services. When a service gets a request, the application will allocate a process to call that service. It’s possible that these services may fail for any reason, such as high latency. This is especially problematic if the service being called is a high-demand service because it means that it will get more requests. As a result, we’ll have to allocate more processes, and all these processes will have to wait to call and get a response from the service.

Now, if the majority of our processes are occupied by this one service, that would leave only a few processes for the other services. This leads to the possibility of the leftover processes being occupied by the remaining services and, in turn, blocking all the processes of the application. The requests, however, will not stop coming and will add up until the processes are unblocked. Even after the service recovers, the processes will be busy processing the requests that queued up while the service was unavailable. Before long, it might lead to cascading failures throughout the application.

The scenario above has been illustrated in the following slides:

Created with Fabric.js 3.6.6
A request is made for Service A and a process is assigned to it

1 of 5

Created with Fabric.js 3.6.6
There are several requests for Service A because it’s a high-demand service

2 of 5

Created with Fabric.js 3.6.6
Service A goes down for some reason and all the processes assigned to it become blocked as they’re waiting for a response

3 of 5

Created with Fabric.js 3.6.6
The remaining processes are assigned to entertain the other incoming requests

4 of 5

Created with Fabric.js 3.6.6
The requests begin to pile up because all the processes are either busy or blocked

5 of 5

A scenario like this is perfect for demonstrating the utility of the circuit breaker pattern. First, we’ll have to define a failure threshold for our services. For our case, let's assume it to be 300 ms. That is, if a service is taking longer than 300 ms to respond, we’ll consider it to have failed.

Let's go through this process and see how adding a circuit breaker affects our scenario:

  • Normally the circuit breaker is in a closed state and all the requests will go through to the service.

  • If a significant amount of these requests, let’s say 50%, are exceeding the failure threshold (taking longer than 300 ms to get served) we have defined previously, the breaker assumes that this service is unresponsive and will “trip” and go into the open state. The breaker then sends a message to the client, and the client can either wait for the service to be responsive or redirect to another instance. This prevents requests from queuing up and frees up processes because they’re no longer blocked by an unresponsive service.

  • After the timeout expires, the breaker will move to the half-open state and allow a fraction of the total requests to go through to its corresponding service.

  • Let's say that the service normally gets 100 requests per second. In the half-open state, the circuit breaker would allow 25 of these requests to pass to the service. If all these requests succeed, then the breaker assumes the service is functioning properly and moves to the closed state so that the service can carry on as usual. However, if even one of the requests in the half-open state fails, then the breaker reverts to the open state and the timeout begins again.

This process has been illustrated by the slides below:

Created with Fabric.js 3.6.6
A process sends a successful request to Service B and gets a response

1 of 7

Created with Fabric.js 3.6.6
A request is sent to Service A and is successful

2 of 7

Created with Fabric.js 3.6.6
Two more requests are sent to the Service A. One is successful while the other is not

3 of 7

Created with Fabric.js 3.6.6
Another request is sent to the Service A, which also fails, bringing the failure percentage to 50% over some time period

4 of 7

Created with Fabric.js 3.6.6
The circuit breaker for Service A goes into the open state and initiates a failure timeout and any requests to service A are stopped by the circuit breaker, which sends the appropriate response to the client

5 of 7

Created with Fabric.js 3.6.6
Once the failure timeout is over, the circuit breaker for Service A goes into the half-open state and allows a limited number of requests. Any extra requests are dropped immediately

6 of 7

Created with Fabric.js 3.6.6
All the requests during the half-open state are successful, so the circuit breaker returns to the closed state

7 of 7

Note: In the slides above we have abbreviated "circuit breaker" to CB.

In the slides above, we have a system with seven processes calling the available services. When 50% of the requests going to Service A end in failure, the circuit breaker attached to it goes into the open state and activates a timeout period for the service, and any subsequent requests to that service will immediately end in failure. Immediate failure frees up the processes to take care of other requests while giving time to Service A to recover.

After the timeout of the open state is over, the circuit breaker switches over to the half-open state. In this state, a few requests are allowed by the circuit breaker to pass to the service, in our case we allow two out of three requests to go through. Because all of them are successful, the circuit breaker reverts to the closed state, and Service A returns to functioning normally.

Cascading failures scenario#

A cascading failure happens when the failure of one service affects the performance of the services dependent on it, causing multiple services in the system to fail.

Created with Fabric.js 3.6.6
A client is calling Service 1 and 2, which are dependent on other services

1 of 4

Created with Fabric.js 3.6.6
Service 5 fails and Service 3 and 4 are left waiting for a response

2 of 4

Created with Fabric.js 3.6.6
Service 3 and 4 also fail because they have been kept waiting for too long

3 of 4

Created with Fabric.js 3.6.6
Service 1 and Service 2 then fail as well because they were waiting on Services 3 and 4 and the client was left waiting for a response

4 of 4

In the slides above, we have a system with interconnected services. If one fails, then the rest are at risk of failure as well, causing a cascaded failure of the system. The failed service is not able to respond to requests, causing the dependent services to have to wait for too long and eventually fail due to not receiving any response in time.

The circuit breaker pattern can help us in this regard by adding an extra layer that the services have to go through to make a call. This allows services to fail faster if they’re being unresponsive and protects the clients from potentially failing if the target service is unresponsive.

Let's visualize how circuit breakers can help us fix or mitigate the issue of cascading failures:

Created with Fabric.js 3.6.6
Service 3 gets a successful response from Service 6, but Service 4 and 5 do not

1 of 4

Created with Fabric.js 3.6.6
Service 6 exceeds the failure threshold, so the circuit breaker moves to the open state and any requests to Service 6 should now fail. Service 3, 4, and 5 will quickly get the response of the failure of Service 6 and won’t be kept waiting

2 of 4

Created with Fabric.js 3.6.6
After the timeout is over, the circuit breaker is sent into the half-open state and it only allows two of the three requests to pass through

3 of 4

Created with Fabric.js 3.6.6
Because all the requests that were forwarded during the half-open state succeeded, the circuit breaker reverts to the closed state

4 of 4

In the slides above, we have a system of interconnected services, each of which has its own circuit breaker. If Service 6 crosses the failure rate threshold, then the circuit breaker attached to it switches to the open state. As a result, any requests to Service 6 are stopped by the circuit breaker and sends a response to the requesting clients, allowing them to either wait or to be redirected to another instance of the service. This contains the failure to one service because the other services should immediately get responses of failures and can move on to other tasks.

As mentioned in the previous section, the open-state timeout of the circuit breaker gives the service time to repair. Once the timeout is over it will go into the half-open state. In this state, a fraction of the usual requests (in our case, two out of three) are sent to the service, if all of them succeed, then the circuit breaker goes into the closed state and the service starts functioning normally again.

Netflix uses this pattern in their product to make it more resilient and fault tolerant.

They have developed the Hystrix framework, which is based on the circuit breaker pattern we have been studying in this lesson. It’s available for public use and can be added to existing applications fairly easily. If you’re interested, then it’s worth checking out for some more details, but it’s out of the scope of this course.

Summary #

In this lesson, we learned that the circuit breaker pattern is an extremely useful technique to include in our API design strategies. The purpose of this pattern is to allow services to "fail fast and recover ASAP." This pattern will help us in building resilient systems that can protect our system from cascading failures and  protect our client processes from being blocked due to having to wait for a response from a failed service.

Quiz

Question

On the surface, it might seem like circuit breakers have a similar function to rate limiters. Why should we use one over the other?

Hide Answer

While they may seem to be similar on the surface because they’re both used to limit calls to an API or a service, upon further exploration, we can see that they have two different use cases.

In general, rate limiters are not very complex. They restrict the calls being made to a service in a specific period of time. They do this by monitoring the number of calls being made to a service. If the number of calls in the specified time frame exceeds the limit set by the system, then the rate limiter begins a timeout, during which any calls to the service are dropped. Once the timeout is over, the service returns to normal functioning. The rate limiter essentially helps us protect the server from overloading by controlling the throughput.

One of the main differences circuit breakers have with rate limiters is that they ensure the failure remains isolated to one component and are used to keep the client service safe when the target service is unresponsive. Circuit breakers are smarter and more resilient than rate limiters. This is because they’re able to detect failures and shut off access to the failed service, while rate limiters do no such thing. Therefore, circuit breakers are preferred with more complex systems, such as one with cascading services.

Circuit breakers are also concerned with the health of the service and the half-open state of the circuit breaker is there to check if the service is healthy enough to function properly. On the other hand, the rate limiter is not concerned with the health of the service. It only limits the number of requests made to a service.

Resource Hints and Debouncing

Managing Retries